49 research outputs found

    Scalable Facility Location for Massive Graphs on Pregel-like Systems

    Full text link
    We propose a new scalable algorithm for facility location. Facility location is a classic problem, where the goal is to select a subset of facilities to open, from a set of candidate facilities F , in order to serve a set of clients C. The objective is to minimize the total cost of opening facilities plus the cost of serving each client from the facility it is assigned to. In this work, we are interested in the graph setting, where the cost of serving a client from a facility is represented by the shortest-path distance on the graph. This setting allows to model natural problems arising in the Web and in social media applications. It also allows to leverage the inherent sparsity of such graphs, as the input is much smaller than the full pairwise distances between all vertices. To obtain truly scalable performance, we design a parallel algorithm that operates on clusters of shared-nothing machines. In particular, we target modern Pregel-like architectures, and we implement our algorithm on Apache Giraph. Our solution makes use of a recent result to build sketches for massive graphs, and of a fast parallel algorithm to find maximal independent sets, as building blocks. In so doing, we show how these problems can be solved on a Pregel-like architecture, and we investigate the properties of these algorithms. Extensive experimental results show that our algorithm scales gracefully to graphs with billions of edges, while obtaining values of the objective function that are competitive with a state-of-the-art sequential algorithm

    EviDense: a Graph-based Method for Finding Unique High-impact Events with Succinct Keyword-based Descriptions

    Get PDF
    International audienceDespite the significant efforts made by the research community in recent years, automatically acquiring valuable information about high impact-events from social media remains challenging. We present EVIDENSE, a graph-based approach for finding high-impact events (such as disaster events) in social media. Our evaluation shows that our method outper-forms state-of-the-art approaches for the same problem, in terms of having higher precision, lower number of duplicates, while providing a keyword-based description that is succinct and informative

    Listing k-cliques in Sparse Real-World Graphs

    Get PDF
    International audienceMotivated by recent studies in the data mining community which require to efficiently list all k-cliques, we revisit the iconic algorithm of Chiba and Nishizeki and develop the most efficient parallel algorithm for such a problem. Our theoretical analysis provides the best asymptotic upper bound on the running time of our algorithm for the case when the input graph is sparse. Our experimental evaluation on large real-world graphs shows that our parallel algorithm is faster than state-of-the-art algorithms, while boasting an excellent degree of parallelism. In particular, we are able to list all k-cliques (for any k) in graphs containing up to tens of millions of edges as well as all 10-cliques in graphs containing billions of edges, within a few minutes and a few hours respectively. Finally, we show how our algorithm can be employed as an effective subroutine for finding the k-clique core decomposition and an approximate k-clique densest subgraphs in very large real-world graphs

    Fully Dynamic k -Center Clustering

    Get PDF
    International audienceStatic and dynamic clustering algorithms are a fundamental tool in any machine learning library. Most of the efforts in developing dynamic machine learning and data mining algorithms have been focusing on the sliding window model (where at any given point in time only the most recent data items are retained) or more simplistic models. However, in many real-world applications one might need to deal with arbitrary deletions and insertions. For example, one might need to remove data items that are not necessarily the oldest ones, because they have been flagged as containing inappropriate content or due to privacy concerns. Clustering trajectory data might also require to deal with more general update operations. We develop a (2 +)-approximation algorithm for the k-center clustering problem with "small" amortized cost under the fully dynamic adversarial model. In such a model, points can be added or removed arbitrarily, provided that the adversary does not have access to the random choices of our algorithm. The amortized cost of our algorithm is poly-logarithmic when the ratio between the maximum and minimum distance between any two points in input is bounded by a polynomial, while k and are constant. Our theoretical results are complemented with an extensive experimental evaluation on dynamic data from Twitter, Flickr, as well as trajectory data, demonstrating the effectiveness of our approach

    Revisiting Opinion Dynamics with Varying Susceptibility to Persuasion via Non-Convex Local Search

    Get PDF
    International audienceWe revisit the opinion susceptibility problem that was proposed by Abebe et al. [1], in which agents influence one another's opinions through an iterative process. Each agent has some fixed innate opinion. In each step, the opinion of an agent is updated to some convex combination between its innate opinion and the weighted average of its neighbors' opinions in the previous step. The resistance of an agent measures the importance it places on its innate opinion in the above convex combination. Under non-trivial conditions, this iterative process converges to some equilibrium opinion vector. For the unbudgeted variant of the problem, the goal is to select the resistance of each agent (from some given range) such that the sum of the equilibrium opinions is minimized. Contrary to the claim in the aforementioned KDD 2018 paper, the objective function is in general non-convex. Hence, formulating the problem as a convex program might have potential correctness issues. We instead analyze the structure of the objective function, and show that any local optimum is also a global optimum, which is somehow surprising as the objective function might not be convex. Furthermore, we combine the iterative process and the local search paradigm to design very efficient algorithms that can solve the unbudgeted variant of the problem optimally on large-scale graphs containing millions of nodes

    Finding events in temporal networks: Segmentation meets densest-subgraph discovery

    Get PDF
    International audienceIn this paper we study the problem of discovering a timeline of events in a temporal network. We model events as dense subgraphs that occur within intervals of network activity. We formulate the event-discovery task as an optimization problem, where we search for a partition of the network timeline into k non-overlapping intervals, such that the intervals span subgraphs with maximum total density. The output is a sequence of dense subgraphs along with corresponding time intervals, capturing the most interesting events during the network lifetime. A naĂŻve solution to our optimization problem has polynomial but prohibitively high running time complexity. We adapt existing recent work on dynamic densest-subgraph discovery and approximate dynamic programming to design a fast approximation algorithm. Next, to ensure richer structure, we adjust the problem formulation to encourage coverage of a larger set of nodes. This problem is NP-hard even for static graphs. However, on static graphs a simple greedy algorithm leads to approximate solution due to submodularity. We extended this greedy approach for the case of temporal networks. However, the approximation guarantee does not hold. Nevertheless, according to the experiments, the algorithm finds good quality solutions

    Finding events in temporal networks : segmentation meets densest subgraph discovery

    Get PDF
    In this paper, we study the problem of discovering a timeline of events in a temporal network. We model events as dense subgraphs that occur within intervals of network activity. We formulate the event discovery task as an optimization problem, where we search for a partition of the network timeline into k non-overlapping intervals, such that the intervals span subgraphs with maximum total density. The output is a sequence of dense subgraphs along with corresponding time intervals, capturing the most interesting events during the network lifetime. A naĂŻve solution to our optimization problem has polynomial but prohibitively high running time. We adapt existing recent work on dynamic densest subgraph discovery and approximate dynamic programming to design a fast approximation algorithm. Next, to ensure richer structure, we adjust the problem formulation to encourage coverage of a larger set of nodes. This problem is NP-hard; however, we show that on static graphs a simple greedy algorithm leads to approximate solution due to submodularity. We extend this greedy approach for temporal networks, but we lose the approximation guarantee in the process. Finally, we demonstrate empirically that our algorithms recover solutions with good quality.Peer reviewe

    AutoML: state of the art with a focus on anomaly detection, challenges, and research directions

    Get PDF
    International audienceThe last decade has witnessed the explosion of machine learning research studies with the inception of several algorithms proposed and successfully adopted in different application domains. However, the performance of multiple machine learning algorithms is very sensitive to multiple ingredients (e.g., hyper-parameters tuning and data cleaning) where a significant human effort is required to achieve good results. Thus, building well-performing machine learning algorithms requires domain knowledge and highly specialized data scientists. Automated machine learning (autoML) aims to make easier and more accessible the use of machine learning algorithms for researchers with varying levels of expertise. Besides, research effort to date has mainly been devoted to autoML for supervised learning, and only a few research proposals have been provided for the unsupervised learning. In this paper, we present an overview of the autoML field with a particular emphasis on the automated methods and strategies that have been proposed for unsupervised anomaly detection
    corecore